17. Under the Hood Part 2
What's the backward pass doing?
Now that we've re-established the flow of the variables, let's show the steps within the backward pass more explicitly.
Matching Equations 1
QUIZ QUESTION::
Match the below partial derivatives for \Large \frac{\partial G}{\partial W} with their related equations.
ANSWER CHOICES:
|
Partial derivative |
Equation |
|---|---|
\large g_2 \cdot x_{22} |
|
\large g_2 \cdot x_{21} |
|
\large g_1 \cdot x_{11} |
|
\large g_1 \cdot x_{12} |
Calculate \large \frac{\partial G}{\partial W}
We'll start by calculating \large \frac{\partial G}{\partial W} for each gradient with respect to each weight (as you've done above in the quiz!):
- For data point 1:
- \large \frac{\partial g_1}{\partial w_1} = g_1 \cdot x_{11}
- \large \frac{\partial g_1}{\partial w_2} = g_1 \cdot x_{12}
- For data point 2:
- \large \frac{\partial g_2}{\partial w_1} = g_2 \cdot x_{21}
- \large \frac{\partial g_2}{\partial w_2} = g_2 \cdot x_{22}
Now, sum over the two data points for each weight:
- \large \frac{\partial G}{\partial w_1} = \large \frac{\partial g_1}{\partial w_1} + \large \frac{\partial g_2}{\partial w_1} = g_1 \cdot x_{11} + g_2 \cdot x_{21}
- \large \frac{\partial G}{\partial w_2} = \large \frac{\partial g_1}{\partial w_2} + \large \frac{\partial g_2}{\partial w_2} = g_1 \cdot x_{12} + g_2 \cdot x_{22}
Converting to matrix multiplication
Let's look back at our matrices from before to show how the above equations work with matrix multiplication.
Matrix Multiplication
In matrix form, the equations from above equate to the following:
\frac{\partial G}{\partial W} = X.T \circ G, where:
X.T= \begin{bmatrix}x_{11}, x_{21}\\x_{12}, x_{22}\end{bmatrix}
G= \begin{bmatrix}g_1\\g_2\end{bmatrix}
\frac{\partial G}{\partial W}= \begin{bmatrix}g_1 \cdot x_{11} + g_2 \cdot x_{21}\\g_1 \cdot x_{12} + g_2 \cdot x_{22}\end{bmatrix}
Calculate \large \frac{\partial G}{\partial X}
We're not done yet! We still need to calculate the partial derivative of the gradient with respect to our input, X.
- For data point 1:
- \large \frac{\partial g_1}{\partial x_{11}} = g_1 \cdot w_1
- \large \frac{\partial g_1}{\partial x_{12}} = g_1 \cdot w_2
- For data point 2:
- \large \frac{\partial g_2}{\partial x_{21}} = g_2 \cdot w_1
- \large \frac{\partial g_2}{\partial x_{22}} = g_2 \cdot w_2
And to convert this for matrix multiplication:
\frac{\partial G}{\partial X} = G \circ W.T, where:
G= \begin{bmatrix}g_1\\g_2\end{bmatrix}
W.T= \begin{bmatrix}w_1, w_2\end{bmatrix}
\frac{\partial G}{\partial X}= \begin{bmatrix}g_1 \cdot w_1, g_1 \cdot w_2\\g_2 \cdot w_1, g_2 \cdot w_2\end{bmatrix}
That's it!
You've now calculated the partial derivative we need to be able to update our weights, as well as the partial derivative of the input that we would feed to another layer, if that layer existed (i.e., our gradient with respect to X would become the gradient for the next layer upward in the network architecture).